Back to Dashboard

Note 5 Relational Database Design and Normalization

Author: Zhen Tong 120090694@link.cuhk.edu.cn Lecturer: Jan YOU

Before Everything

This note will mainly focus on what is a suitable schema.

A Good Design

Before learning the idea of a good design, let’s see what is bad first.

Update Anomaly

Take the example from the lecture:

If we change to another design like this:

EMP_PROJ

EmpSsn PnumberEnamePnameNo_hours

We combine the Employee with the Project, which can be a bad design, because, now when we want to change the name of this project from "Billing" to "Customer Accounting" (presumably because the project's focus has changed), this update operation may have unintended consequences.

The anomaly suggests that making this change for "P1" might result in the name of the project being updated for all the employees who are working on the project "P1," even if they are still working on the original "Billing" project. This is an anomaly because the update operation is affecting more data than what was initially intended.

This problem is called Update Anomaly.

Insertion Anomaly

Still this schema, it can cause Insertion Anomaly

  1. Cannot naturally insert a project unless an employee is assigned to it
  1. Conversely, cannot insert an employee unless he/she is assigned to a project

This part of the statement suggests that you can't add information about an employee to the database unless that employee is already assigned to a project and vice versa. In other words, employee records depend on project assignments, and you can't add employees unless they are working on a project.

Redundancy

Besides insertion and update anomaly, redundancy is also a problem. Look at this table below:

In the EMP_PROJ table, the Ename, Pname, and the Plocation are redundantly stored. I guess someone may come up with a question. “These data are not redundant, they are useful! You can visit them directly without using join, therefore it is time-saving!

To answer this question, let’s do some math. Suppose you have MM employees and NN projects, and the redundancy data number is 3MN3MN. As long as you increase one new entry in either employees or projects, you will need to consume 3M3M or 3N3N space to store the redundant data. However, in the original design, only O(1)O(1) space is used.

First Normal Form

After we know what is a bad design, let's introduce the main idea of normalization in this note.

In 1NF, a relational schema is organized in a specific way to ensure that data is stored and retrieved efficiently and without redundancy.

The key idea of the First Normal Form is the concept of an "atomic" domain, which means that the values in a column (attribute) of a relational table should be indivisible or considered as single, discrete units. In other words, each value in a column should be a simple, elementary piece of data, not a collection of related or grouped data.

For example, the university has courses like “csc3170”, “mat2050” the first letters are the subject, and the numbers are unique number in that subject. Then, this is not a atomic domain, because it can be further decomposed.

Also, let’s see how can we improve the example we have above EMP_PROJ

Do you still remember the update problem we have? Decomposition of EMP_PROJ into relations EMP_PROJ1 and EMP_PROJ2 by propagating the primary key. Then we can update either of them independently.

Functional Dependency and Decomposition

From the 2 example above, we know the first normalization always decompose larger design into smaller one. But does the decomposed structure always be a good design? What problems can it cause, and how can we detect them?

First, let’s define an important relationship among attribute, functional dependency

A functional dependency is denoted as XYX \rightarrow Y, where X is the determinant attribute(s), and Y is the dependent attribute(s). This means that for every unique combination of values in X, there is a unique value of Y.

Here's an example to illustrate functional dependency:

Let's say we have a table called "Students" with the following attributes:

In this scenario, we can observe functional dependencies:

  1. Student_ID →Student_Name:
    • Every unique Student_ID is associated with a unique Student_Name. For a given Student_ID, there is only one corresponding Student_Name.
  1. Student_ID → Student_Age:
    • Similarly, each Student_ID is associated with a unique Student_Age. For a given Student_ID, there is only one corresponding Student_Age.
  1. Student_ID → Student_Grade:
    • Again, there is a functional dependency. Each Student_ID is associated with a unique Student_Grade.

Lossless Decomposition

Then, let’s come back to the question: “does the decomposed structure always be a good design? “ Well, the answer is no. only the lossless decomposition is good.

For example

If we decompose the employee schema into two to store. When we need to use the merge data, we join them, then if two people have identical names, the join result will be in a number of A×B=4|A|\times|B| = 4, however, the right answer is 2, because there is only 2 person.

Formally, the decomposition of a relation shema is RR, and if we decompose it into R=R1R2R = R_1\cap R_2, and join it back. Any entry rr in the RR, then decomposed into ΠR1(r)\Pi_{R_1}(r) and ΠR2(r)\Pi_{R_2}(r) should join to only as rr. i.e. ΠR1(r)ΠR2(r)=r\Pi_{R_1}(r)\bowtie\Pi_{R_2}(r) = r. Conversely a decomposition is lossy if r<ΠR1(r)ΠR2(r)|r|<|\Pi_{R_1}(r)\bowtie\Pi_{R_2}(r) |

Closure

Closure is a set conception, and I want to spare more time and effort to define everything clear.

A "closure" refers to the set of all functional dependencies that can be logically implied or derived from a given set of functional dependencies. The closure helps us determine all the relationships between attributes (columns) in a database based on the initial set of dependencies.

Here's an explanation of the concept:

  1. Set of Functional Dependencies (F): This is the initial set of functional dependencies that you have in a database. Each functional dependency in this set consists of an implication between two sets of attributes. For example, "A -> B" means that attribute A determines attribute B.
  1. Closure of F (F+): The closure of F, denoted as F+, is the set of all functional dependencies that can be logically derived from the initial set F. In other words, F+ contains all the dependencies that are implied by the transitive and reflexive properties of functional dependencies.
    • Transitive Property: If you have functional dependencies A -> B and B -> C, you can infer the dependency A -> C. This means that if A determines B, and B determines C, then A determines C.
    • Reflexive Property: Every attribute determines itself, so if you have a dependency A, you can consider it as A -> A.

By applying these properties, you can determine all the implied functional dependencies within the database. The closure F+ ensures that you have a complete and consistent understanding of how attributes depend on one another in your database schema.

Here's an example to illustrate the concept:

Suppose you have the following set of functional dependencies:

From these dependencies, you can derive the following implied dependencies:

So, in this case, the closure F+ includes:

The closure F+ provides a comprehensive understanding of the relationships between attributes based on the initial set of functional dependencies.

Keys and Functional Dependencies

Hope you still remember the conception of superkey, candidate key, primary key, and secondary key. Here is a recap:

Here introduce some new concepts:

Trivial Functional Dependencies

It is called trivial because it doesn’t tell you any extra information, because it is self-contained in the original FD.

for example:

Lossless Decomposition and Functional Dependency Property

From above, we know that r=ΠR1(r)ΠR2(r)r = \Pi_{R_1}(r)\bowtie\Pi_{R_2}(r)  is a lossless decomposition. This can be equal to this property.

A decomposition of R into R1R_1 and R2R_2 is lossless decomposition if at least one of the following dependencies is in the closure of functional dependency of R1R_1 and R2R_2

Let me further explain this condition. First R1R2R_1\cap R_2 is the part we can according to when do the join operation. I would like to metaphor this part as the zipper-zigzag part of the operation of two tables

Either can we find the unique right part R2R_2 on the right hand side, or should we able to find the unique left part of R1R_1 according to the zigzag intersection. Image if we can find more than one counterpart, what will the joined table RR' like? The zipzag will be much longer isn’t it. which means the cloth will broke with this wrong zipzag.

Here is an example:

R=(A,B,C)R = (A, B, C)\\F=(AB,BC)F = (A\rightarrow B, B\rightarrow C)

R1=(A,B),R2=(B,C)R_1 = (A, B), R_2 = (B, C)

Because R1R2=BR_1\cap R_2 = B

and BBCB\rightarrow BC

This is a lossless decomposition.

Second Normal Form

Full functional dependency: an FD Y → Z where removal of any attribute from Y means the FD does not hold anymore.

A relation schema R is in second normal form (2NF) if it is in first normal form, and every nonprime attribute A in R is fully functionally dependent on the primary key (assuming it is the only candidate key).

The thing we improve from partial FD 1NF design to fully FD 2NF design is doing decomposition again. We should iteratively check the secondary attributes and separate the partial FD, and change them into fully FD.

Eliminating Partial Functional Dependencies

2NF helps reduce data redundancy by ensuring that non-prime attributes depend on the entire primary key. This means that for a relation to be in 2NF, non-prime attributes are functionally dependent on the entire primary key, and there are no partial dependencies. By adhering to 2NF, you store data efficiently, without duplication.

Third Normal Form

Transitive functional dependency: an FD X→Z that can be derived from two FDs X → Y and Y → Z, where Y is a nonprime attribute

3NF:

A relation schema R is in third normal form (3NF) if it is in 2NF and no nonprime attribute A in R is transitively dependent on the primary key (assuming it is the only candidate key)

Let’s further look at this two example.

A tip in doing 2NF normalization task is look secondary attribute one by one to see if it is partial FD. The Hours is dependent on the Ssn and Pnumber. The Eame is only depend on the Ssn, and Pname, and Plocation is depend on the Pnumber. Therefore there are 3 kinds of secondary keys. Because there is no transitive functional dependency, so we stop at the 2NF.

A tip in doing 3NF normalization task is look the dependency among secondary attributes. In the example of (b), we do the same thing, check the secondary attribute one-by-one, and we can find out the EMP_DEPT is already a 2NF. Because the Dnumber is not a pk in EMP_DEPT , and Dname, and Dmgr_ssn depend on it, we decomposite it into 3NF.

In this example, we first check partial FD, and find the County_nameTax_rate is partial FD. Therefore do 2NF. Then we find out the AreaPrice is a transitive FD. Continue do 3NF

General Normal Form

All the definitions of 1NF, 2NF, and 3NF are based on the primary key only. Then, what if there multiple candidate keys? The following statements are broadening the definition into more general world.

A relation schema R is in second normal form (2NF) if it is in 1NF, and every nonprime attribute A in R is fully functionally dependent on every candidate key of R. You can refer to the situation of LOTS above.

To be more specific, consider another example:

Then, does this structure violate 2NF? No, it doesn’t although the candidate B, C is not fully depended by D explicitly, A→D itself is enough.

Boyce-Codd Normal Form(BCNF)

Before introduction of the BCNF, let’s first recap what is a superkey:

Then, the BCNF is defined based on the concept of Superkey:

If we call the left hand side of an FD a determinant, like α\alpha attributes, BCNF roughly says that every determinant is a superkey

For example:

in_dep

IDnamesalarydept_namebuildingbudget

If there is one FD: dept_name → building, budget

This in_dep is not a BCNF, because dept_name itself is not a superkey, the superkey is ID, dept_name. Therefore to BCDF this in_dep, we need to decomposite into instructor and department:

instructor

IDnamesalarydept_name

department

dept_namebuildingbudget

Decomposing a Schema into BCNF

With the above example, we can now translate the things we did into math language. Let R be a schema R that is not in BCNF. Let αβ\alpha \rightarrow\beta be the FD that causes a violation of BCNF, i.e. α\alpha is not a superkey of R. To make the decomposited schema be BCNF, and maintain the functional dependency, we need to clean the αβ\alpha\cap\beta. Here is again a metaphor, the violation of BCNF is like a cancer tumour on the body of the original schema. And the only way to fix is to cut the bad partαβ\alpha\cup\beta, and remain the good part βα\beta-\alpha.

Therefore, the RR is decomposed into:

However, although we tried our best to maintain the dependency preserving, i.e. remain the original functional dependency closure, we still cannot guarantee the BCNF decomposition is perfectly dependency preserving.

Third Normal Form Revisited

2NF is 1NF, 3NF is 2NF, then, does BCNF a subset of 3NF? The 3NF is related to the Transitive functional dependency. The BCNF is related to superkey, then let’s see what is the definition of 3NF:

Therefore, if a relation is in BCNF it is in 3NF (since in BCNF one of the first two conditions above must hold). Third condition is a minimal relaxation of BCNF to ensure dependency preservation

However, a relation that is in 3NF but not in BCNF requires that for any nontrivial αβ\alpha\rightarrow\beta , α\alpha is not a superkey but β\beta is a prime attribute

For example, schema dept_advisor

s_IDi_IDdept_name

The two candidate keys are {s_ID, dept_name}, {s_ID, i_ID }

dept_advisor is not BCNF because i_ID is not a superkey

However, it is a 3NF,

i_ID → dept_name and i_ID is NOT a superkey, but

Pros and Cons in 3NF

It is always possible to obtain a 3NF design without sacrificing losslessness or dependency preservation

By definition of 3NF, this is 3NF

However, repetition of information (the association between L and K is repeated several times) Also, it needs to use null values (e.g., to represent the relationship l2,k2l_2, k_2  where there is no corresponding value for J

Pros and Cons in BCNF

There are no non-trivial functional dependencies and therefore the relation is in BCNF. Insertion anomalies – i.e., if we add a phone 981-992-3443 to 99999, we need to add two tuples

(99999, David, 981-992-3443) (99999, William, 981-992-3443)